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Abstract 

We consider the least-square linear regression problem with regular- 
ization by the f i-norm, a problem usually referred to as the Lasso. In this 
paper, we present a detailed asymptotic analysis of model consistency of 
the Lasso. For various decays of the regularization parameter, we compute 
asymptotic equivalents of the probability of correct model selection (i.e., 
variable selection). For a specific rate decay, we show that the Lasso se- 
lects all the variables that should enter the model with probability tending 
to one exponentially fast, while it selects all other variables with strictly 
positive probability. We show that this property implies that if we run 
the Lasso for several bootstrapped replications of a given sample, then 
intersecting the supports of the Lasso bootstrap estimates leads to consis- 
tent model selection. This novel variable selection algorithm, referred to 
as the Bolasso, is compared favorably to other linear regression methods 
on synthetic data and datasets from the UCI machine learning repository. 



1 Introduction 



Regularization by the ^i-norm has attracted a lot of interest in recent years in 
machine learning, statistics and signal processing. In the context of least-square 
linear regression, the problem is usually referred to as the Lasso (Tibshirani, 
1994). Much of the early effort has been dedicated to algorithms to solve the 
optimization problem efficiently. In particular, the Lars algorithm of Efron et al. 
(2004) allows to find the entire regularization path (i.e., the set of solutions for all 
values of the regularization parameters) at the cost of a single matrix inversion. 

Moreover, a well-known justification of the regularization by the ^i-norm 
is that it leads to sparse solutions, i.e., loading vectors with many zeros, and 
thus performs model selection. Recent works (Zhao & Yu, 2006; Yuan & Lin, 
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2007; Zou, 2006; Wainwright, 2006) have looked precisely at the model consis- 
tency of the Lasso, i.e., if we know that the data were generated from a sparse 
loading vector, does the Lasso actually recover it when the number of observed 
data points grows? In the case of a fixed number of covariates, the Lasso does 
recover the sparsity pattern if and only if a certain simple condition on the 
generating covariance matrices is verified (Yuan & Lin, 2007). In particular, 
in low correlation settings, the Lasso is indeed consistent. However, in pres- 
ence of strong correlations, the Lasso cannot be consistent, shedding light on 
potential problems of such procedures for variable selection. Adaptive versions 
where data-dependent weights are added to the ^i-norm then allow to keep the 
consistency in all situations (Zou, 2006). 

In this paper, we first derive a detailed asymptotic analysis of sparsity pat- 
tern selection of the Lasso estimation procedure, that extends previous analy- 
sis (Zhao & Yu, 2006; Yuan & Lin, 2007; Zou, 2006), by focusing on a specific 
decay of the regularization parameter. Wc show that when the decay is pro- 
portional to n -1 / 2 , where n is the number of observations, then the Lasso will 
select all the variables that should enter the model (the relevant variables) with 
probability tending to one exponentially fast with n, while it selects all other 
variables (the irrelevant variables) with strictly positive probability. If several 
datasets generated from the same distribution were available, then the latter 
property would suggest to consider the intersection of the supports of the Lasso 
estimates for each dataset: all relevant variables would always be selected for 
all datasets, while irrelevant variables would enter the models randomly, and 
intersecting the supports from sufficiently many different datasets would simply 
eliminate them. However, in practice, only one dataset is given; but resampling 
methods such as the bootstrap are exactly dedicated to mimic the availabil- 
ity of several datasets by resampling from the same unique dataset (Efron & 
Tibshirani, 1998). In this paper, we show that when using the bootstrap and 
intersecting the supports, wc actually get a consistent model estimate, with- 
out the consistency condition required by the regular Lasso. We refer to this 
new procedure as the Bolasso (bootstrap-enhanced least absolute shrinkage 
operator). Finally, our Bolasso framework could be seen as a voting scheme ap- 
plied to the supports of the bootstrap Lasso estimates; however, our procedure 
may rather be considered as a consensus combination scheme, as we keep the 
(largest) subset of variables on which all regressors agree in terms of variable 
selection, which is in our case provably consistent and also allows to get rid of 
a potential additional hyperparameter. 

The paper is organized as follows: in Section 2, we present the asymptotic 
analysis of model selection for the Lasso; in Section 3, we describe the Bolasso 
algorithm as well as its proof of model consistency, while in Section 4, we illus- 
trate our results on synthetic data, where the true sparse generating model is 
known, and data from the UCI machine learning repository. Sketches of proofs 
can be found in Appendix A. 
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Notations For a vector v £ W, we let denote |M| 2 = (w T w) 1/2 the £ 2 -norm, 
IMIoc = max ie {i i ... iP } \vi\ the ^-norm and ||v||i = J2i=i \ v i\ tnc ^l-norm. For 
oSK, sign(a) denotes the sign of a, defined as sign(a) = 1 if a > 0, —1 if a < 0, 
and if a = 0. For a vector v £ R p , sign(v) £ MP denotes the the vector of signs 
of elements of v. 

Moreover, given a vector v £ R p and a subset / of {1, . . . , p}, vi denotes 
the vector in R Card ( / ) of elements of v indexed by /. Similarly, for a matrix 
A £ R pxp , Ajj denotes the sub-matrix of A composed of elements of A whose 
rows are in / and columns are in J. 

2 Asymptotic Analysis of Model Selection for 
the Lasso 

In this section, we describe existing and new asymptotic results regarding the 
model selection capabilities of the Lasso. 

2.1 Assumptions 

We consider the problem of predicting a response Y £ R from covariates 
X = (Xi, . . . ,X p ) T £ R p . The only assumptions that we make on the joint 
distribution Pxy of (X, Y) are the following: 

(Al) The cumulant generating functions Eexp(s||X|||) and Eexp(sF 2 ) arc fi- 
nite for some s > 0. 

(A2) The joint matrix of second order moments Q = KXX T £ M. pxp is invert- 
ible. 

(A3) E(Y\X) = X T w and va,r(Y\X) = a 2 a.s. for some w £ W and a £ R* + . 

We let denote J = {j, wj ^ 0} the sparsity pattern of w, s = sign(w) 
the sign pattern of w, and e = Y — X T w the additive noise. 1 Note that our 
assumption regarding cumulant generating functions is satisfied when X and e 
have compact support, and also, when the densities of X and e have light tails. 

We consider independent and identically distributed (i.i.d.) data (xi,yi) £ 
MP x R, i = 1, . . . , n, sampled from Pxy', the data are given in the form of 
matrices Y £ R" and X £ M nx P. 

Note that the i.i.d. assumption, together with (Al-3), are the simplest as- 
sumptions for studying the asymptotic behavior of the Lasso; and it is of course 
of interest to allow more general assumptions, in particular growing number of 
variables p, more general random variables, etc. (see, e.g., Mcinshausen and Yu 
(2006)), which are outside the scope of this paper. 

1 Throughout this paper, we use boldface fonts for population quantities. 
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2.2 Lasso Estimation 



We consider the square loss function X^LiO/i — w T Xi) 2 = ^||y — JSfty(|| and 
the regularization by the ^-norm defined as ||tw||i = X^=i \ w i\- That is, we 
look at the following Lasso optimization problem (Tibshirani, 1994): 



where /i n is the regularization parameter. We denote w any global minimum 
of Eq. (1) — it may not be unique in general, but will with probability tending 
to one exponentially fast under assumption (Al). 

2.3 Model Consistency - General Results 

In this section, we detail the asymptotic behavior of the Lasso estimate w, both 
in terms of the difference in norm with the population value w (i.e., regular 
consistency) and of the sign pattern sign(w), for all asymptotic behaviors of 
the regularization parameter /i„. Note that information about the sign pattern 
includes information about the support, i.e., the indices i for which tin is different 
from zero; moreover, when w is consistent, consistency of the sign pattern is in 
fact equivalent to the consistency of the support. 

We now consider five mutually exclusive possible situations which explain 
various portions of the regularization path (we assume (Al-3)); many of these 
results appear elsewhere (Yuan & Lin, 2007; Zhao & Yu, 2006; Fu & Knight, 
2000; Zou, 2006; Bach, 2007) but some of the finer results presented below are 
new (see Section 2.4). 

1. If \i n tends to infinity, then w = with probability tending to one. 

2. If n n tends to a finite strictly positive constant /j,q, then w converges in 
probability to the unique global minimum of \{w— w) T Q(w— w)+/i |Mli- 
Thus, the estimate w never converges in probability to w, while the sign 
pattern tends to the one of the previous global minimum, which may or 
may not be the same as the one of w. 2 

3. If fj, n tends to zero slower than n" 1 / 2 , then w converges in probability to w 
(regular consistency) and the sign pattern converges to the sign pattern of 
the global minimum of ^v T Qv+vJ sign(wj) + ||i!jc|| 1 . This sign pattern is 
equal to the population sign vector s = sign(w) if and only if the following 
consistency condition is satisfied: 



Thus, if Eq. (2) is satisfied, the probability of correct sign estimation is 
tending to one, and to zero otherwise (Yuan & Lin, 2007). 

2 Here and in the third regime, we do not take into account the pathological cases where 
the sign pattern of the limit in unstable, i.e., the limit is exactly at a hinge point of the 
regularization path. 
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Qj<=jQ JJ 1 sign(wj)|| 00 < 1. 
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4. If n n = liqu^ 1 / 2 for fi a G (0,oo), then the sign pattern of w agrees on 
J with the one of w with probability tending to one, while for all sign 
patterns consistent on J with the one of w, the probability of obtaining 
this pattern is tending to a limit in (0, 1) (in particular strictly positive); 
that is, all patterns consistent on J are possible with positive probability. 
Sec Section 2.4 for more details. 

5. If fj,„ tends to zero faster than n~ 1//2 , then w is consistent (i.e., converges 
in probability to w) but the support of w is equal to {1, with 
probability tending to one (the signs of variables in J c may be negative or 
positive). That is, the £i-norm has no sparsifying effect. 

Among the five previous regimes, the only ones with consistent estimates (in 
norm) and a sparsity-inducing effect are fj,„ tending to zero and Unn 1 ' 2 tending 
to a limit /j,q £E (0, 00] (i.e., potentially infinite). When /io = +00, then we can 
only hope for model consistent estimates if the consistency condition in Eq. (2) 
is satisfied. This somewhat disappointing result for the Lasso has led to various 
improvements on the Lasso to ensure model consistency even when Eq. (2) is not 
satisfied (Yuan & Lin, 2007; Zou, 2006). Those are based on adaptive weights 
based on the non regularized least-square estimate. We propose in Section 3 an 
alternative way which is based on resampling. 

In this paper, we now consider the specific case where \i n = 1/ ' 2 for 
fin € (0, 00), where we derive new asymptotic results. Indeed, in this situation, 
we get the correct signs of the relevant variables (those in J) with probability 
tending to one, but we also get all possible sign patterns consistent with this, 
i.e., all other variables (those not in J) may be non zero with asymptotically 
strictly positive probability. However, if we were to repeat the Lasso estimation 
for many datasets obtained from the same distribution, we would obtain for 
each [j,q, a set of active variables, all of which include J with probability tending 
to one, but potentially containing all other subsets. By intersecting those, we 
would get exactly J. 

However, this requires multiple copies of the samples, which are not usually 
available. Instead, we consider bootstrapped samples which exactly mimic the 
behavior of having multiple copies. See Section 3 for more details. 

2.4 Model Consistency with Exact Root-?i Regularization 
Decay 

In this section we present detailed new results regarding the pattern consistency 
for fx n tending to zero exactly at rate n -1 / 2 (see proofs in Appendix A): 

Proposition 1 Assume (Al-3) and /.*„ = [i^n^ 1 ! 2 , \i a > 0. Then j "or any sign 
pattern s G { — 1,0, 1} P such that sj = sign(wj), P(sign(w) = s) tends to a limit 
p(s,Ho) € (0, 1), and we have: 

P(sign(w) = s) — p(s, /io) = OirC 1 ! 2 logn). 
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Proposition 2 Assume (Al-3) and fj,„ = /ion -1 / 2 , yUo > 0. Then, for any 
pattern s G { — 1, 0, 1} P such that sj ^ sign(wj) ; there exist a constant A([1q) > 
such that 

logP(sign(w) = s) ^ -nA(p Q ) + 0(n -1 / 2 ). 

The last two propositions state that we get all relevant variables with probability 
tending to one exponentially fast, while we get exactly get all other patterns 
with probability tending to a limit strictly between zero and one. Note that 
the results that we give in this paper are valid for finite n, i.e., we could derive 
actual bounds on probability of sign pattern selections with known constants 
that explictly depend on w, Q and Pxy- 

3 Bolasso: Bootstrapped Lasso 

Given the n i.i.d. observations (xi, yi) G R d x R, i = 1, . . . , n, given by matrices 
X G M. nxp and Y G R", we consider m bootstrap replications of the n data 
points (Efron & Tibshirani, 1998); that is, for k = l,...,m, we consider a 

ghost sample (xf,j/.f) G M d x R, i = 1, . . . , n, given by matrices X k G R nxp 

and Y G R™. The n pairs (xf,yf), i = l,...,n, are sampled uniformly at 
random with replacement from the n original pairs in (X,Y). The sampling 
of the nm pairs of observations is independent. In other words, we defined the 
distribution of the ghost sample (X , Y ) by sampling n points with replacement 
from (X, Y), and, given (X, Y), the m ghost samples are independently sampled 
i.i.d. from the distribution of (X ,Y ). 

The asymptotic analysis from Section 2 suggests to estimate the supports 
Jfc = { j, Wj ^ 0} of the Lasso estimates w k for the bootstrap samples, k = 
l,...,m, and to intersect them to define the Bolasso model estimate of the 
support: J = HfeLi Jk- Once J is selected, we estimate w by the unregularizcd 
least-square fit restricted to variables in J. The detailed algorithm is given 
in Algorithm 1. The algorithm has only one extra parameter (the number 
of bootstrap samples m). Following Proposition 3, log(m) should be chosen 
growing with n asymptotically slower than n. In simulations, we always use 
m = 128 (except in Figure 3, where we exactly study the influence of m). 

Note that in practice, the Bolasso estimate can be computed simultaneously 
for a large number of regularization parameters because of the efficiency of the 
Lars algorithm (which we use in simulations), that allows to find the entire 
regularization path for the Lasso at the (empirical) cost of a single matrix in- 
version (Efron et al., 2004). Thus computational complexity of the Bolasso is 
0(m(p 3 +p 2 n)). 

The following proposition (proved in Appendix A) shows that the previous 
algorithm leads to consistent model selection. 

Proposition 3 Assume (Al-3) and fi n = ^n^ 1 / 2 , fio > 0. Then the probabil- 
ity that the Bolasso does not exactly select the correct model, i.e., for all m > 0, 
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Algorithm 1 Bolasso 

Input: data (X,Y) £ R™x(p+ 1 ) 

number of bootstrap replicates to 

regularization parameter \x 

for k = 1 to to do 

Generate bootstrap samples (X~ k ,Y k ) 6 E" x (p +1 ) 

k ^ 

Compute Lasso estimate w from (X ,Y ) 
Compute support Jk = {j, Wj ^ 0} 
end for 

Compute J = HfcLi Jk 
Compute wj from (Xj,Y) 

P( J ^ J) has the following upper bound: 

P(J ? J) < rnAxe"^" + + A 4 ^, 

where Ai, A2, A3, A4 are strictly positive constants. 

Therefore, if log(TO) tends to infinity slower than n when n tends to infinity, 
the Bolasso asymptotically selects with overwhelming probability the correct 
active variable, and by regular consistency of the restricted least-square esti- 
mate, the correct sign pattern as well. Note that the previous bound is true 
whether the condition in Eq. (2) is satisfied or not, but could be improved on if 
we suppose that Eq. (2) is satisfied. See Section 4.1 for a detailed comparison 
with the Lasso on synthetic examples. 

4 Simulations 

In this section, we illustrate the consistency results obtained in this paper with 
a few simple simulations on synthetic examples similar to the ones used by Bach 
(2007) and some medium scale datasets from the UCI machine learning reposi- 
tory (Asuncion & Newman, 2007). 

4.1 Synthetic examples 

For a given dimension p, we sampled Iel p from a normal distribution with 
zero mean and covariancc matrix generated as follows: (a) sample a p x p 
matrix G with independent standard normal distributions, (b) form Q = GG T , 
(c) scale Q to unit diagonal. We then selected the first Card(J) = r variables 
and sampled non zero loading vectors as follows: (a) sample each loading from 
independent standard normal distributions, (b) rescale those to unit magnitude, 
(c) rescale those by a scaling which is uniform at random between ^ and 1 (to 
ensure min^gj |wj| ^ 1/3). Finally, we chose a constant noise level a equal 
to 0.1 times (E(w T X) 2 ) 1 / 2 , and the additive noise e is normally distributed 
with zero mean and variance a 2 . Note that the joint distribution on [X, Y) 
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Figure 1: Lasso: log-odd ratios of the probabilities of selection of each variable 
(white = large probabilities, black = small probabilities) vs. regularization 
parameter. Consistency condition in Eq. (2) satisfied (left) and not satisfied 
(right). 




Figure 2: Bolasso: log-odd ratios of the probabilities of selection of each vari- 
able (white = large probabilities, black = small probabilities) vs. regularization 
parameter. Consistency condition in Eq. (2) satisfied (left) and not satisfied 
(right). 



thus defined satisfies with probability one (with respect to the sampling of the 
covariance matrix) assumptions (Al-3). 

In Figure 1, we sampled two distributions Pxy with p = 16 and r = 8 
relevant variables, one for which the consistency condition in Eq. (2) is satisfied 
(left), one for which it was not satisfied (right). For a fixed number of sample 
n = 1000, we generated 256 replications and computed the empirical frequencies 
of selecting any given variable for the Lasso as the regularization parameter /i 
varies. Those plots show the various asymptotic regimes of the Lasso detailed 
in Section 2. In particular, on the right plot, although no /i leads to perfect 
selection (i.e., exactly variables with indices less than r = 8 are selected), there 
is a range where all relevant variables are always selected, while all others are 
selected with probability within (0, 1). 

In Figure 2, we plot the results under the same conditions for the Bolasso 
(with a fixed number of bootstrap replications m = 128). We can see that in the 
Lasso-consistent case (left), the Bolasso widens the consistency region, while in 
the Lasso- inconsistent case (right), the Bolasso "creates" a consistency region. 



8 




Figure 3: Bolasso (red, dashed) and Lasso (black, plain): probability of correct 
sign estimation vs. regularization parameter. Consistency condition in Eq. (2) 
satisfied (left) and not satisfied (right). The number of bootstrap replications 
m is in {2, 4, 8, 16, 32, 64, 128, 256}. 



In Figure 3, we selected the same two distributions and compared the prob- 
ability of exactly selecting the correct support pattern, for the Lasso, and for 
the Bolasso with varying numbers of bootstrap replications (those probabilities 
are computed by averaging over 256 experiments with the same distribution). 
In Figure 3, we can see that in the Lasso- inconsistent case (right), the Bo- 
lasso indeed allows to fix the unability of the Lasso to find the correct pattern. 
Moreover, increasing m looks always beneficial; note that although it seems to 
contradict the asymptotic analysis in Section 3 (which imposes an upper bound 
for consistency), this is due to the fact that not selecting (at least) the relevant 
variables has very low probability and is not observed with only 256 replications. 

Finally, in Figure 4, we compare various variable selection procedures for 
linear regression, to the Bolasso, with two distributions where p = 64, r = 8 
and varying n. For all the methods we consider, there is a natural way to 
select exactly r variables with no free parameters (for the Bolasso, we select 
the most stable pattern with r elements, i.e., the pattern which corresponds to 
most values of /i). We can see that the Bolasso outperforms all other variable 
selection methods, even in settings where the number of samples becomes of the 
order of the number of variables, which requires additional theoretical analysis, 
subject of ongoing research. Note in particular that we compare with bagging 
of least-square regression (Breiman, 1996a) followed by a thresholding of the 
loading vector, which is another simple way of using bootstrap samples: the 
Bolasso provides a more efficient way to use the extra information, not for usual 
stabilization purposes (Breiman, 1996b), but directly for model selection. Note 
finally, that the bagging of Lasso estimates requires an additional parameter 
and is thus not tested. 

4.2 UCI datasets 

The previous simulations have shown that the Bolasso is succesful at perform- 
ing model selection in synthetic examples. We now apply it to several linear 
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Figure 4: Comparison of several variable selection methods: Lasso (black cir- 
cles), Bolasso (green crosses), forward greedy (magenta diamonds), thresholdcd 
LS estimate (red stars), adaptive Lasso (blue pluses). Consistency condition in 
Eq. (2) satisfied (left) and not satisfied (right). The averaged (over 32 repli- 
cations) variable selection error is computed as the square distance between 
sparsity pattern indicator vectors. 

regression problems and compare it to alternative methods for linear regression, 
namely, ridge regression, Lasso, bagging of Lasso estimates (Breiman, 1996a), 
and a soft version of the Bolasso (referred to as Bolasso-S), where instead of 
intersecting the supports for each bootstrap replications, we select those which 
are present in at least 90% of the bootstrap replications. In Table 1, we consider 
data randomly generated as in Section 4.1 (with p = 32, r = 8, n = 64), where 
the true model is known to be composed of a sparse loading vector, while in Ta- 
ble 2, we consider regression datasets from the UCI machine learning repository. 
For all of those, we perform 10 replications of 10-fold cross validation and for all 
methods (which all have one free regularization parameter), we select the best 
regularization parameter on the 100 folds and plot the mean square prediction 
error and its standard deviation. 

Note that when the generating model is actually sparse (Table 1), the Bolasso 
outperforms all other models, while in other cases (Table 2) the Bolasso is 
sometimes too strict in intersecting models, i.e., the softened version works 
better and is competitive with other methods. Studying the effects of this 
softened scheme (which is more similar to usual voting schemes) , in particular in 
terms of the potential trade-off between good model selection and low prediction 
error, and under conditions where p is large, is the subject of ongoing work. 

5 Conclusion 

We have presented a detailed analysis of variable selection properties of a boos- 
trapped version of the Lasso. The model estimation procedure, referred to 
as the Bolasso, is provably consistent under general assumptions. This work 
brings to light that poor variable selection results of the Lasso may be eas- 
ily enhanced thanks to a simple parameter-free resampling procedure. Our 
contribution also suggests that the use of bootstrap samples by L. Breiman in 
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Table 1: Comparison of least-square estimation methods, data generated as 
described in Section 4.1. with re = ||Qj=jQJjSj||oo (cf. Eq. (2)). Performance 
is measured through mean squared prediction error (multiplied by 100). 



K 


0.93 


1.20 


1.42 


1.28 


Ridge 


8.8 ±4.5 


4.9 ±2.5 


7.3 ±3.9 


8.1 ±8.6 


Lasso 


7.6 ±3.8 


4.4 ±2.3 


4.7 ±2.5 


5.1 ±6.5 


Bolasso 


5.4 ±3.0 


3.4 ±2.4 


3.4 ± 1.7 


3.7 ± 10.2 


Bagging 


7.8 ±4.7 


4.6 ±3.0 


5.4 ±4.1 


5.8 ±8.4 


Bolasso-S 


5.7 ±3.8 


3.0 ±2.3 


3.1 ±2.8 


3.2 ±8.2 



Table 2: Comparison of least-square estimation methods, UCI regression 
datasets. Performance is measured through mean squared prediction error (mul- 
tiplied by 100). 





Autompg 


Imports 


Machine 


Housing 


Ridge 

Lasso 

Bolasso 

Bagging 

Bolasso-S 


18.6±4.9 
18.6±4.9 
18.1±4.7 
18.6±5.0 
17.9±5.0 


7.7±4.8 

7.8±5.2 

20.7±9.8 

8.0±5.2 

8.2±4.9 


5.8±18.6 
5.8±19.8 
4.6±21.4 
6.0±18.9 
4.6±19.9 


28.0±5.9 
28.0±5.7 
26.9±2.5 
28.1±6.6 
26.8±6.4 



Bagging/ Arcing/Random Forests (Breiman, 1998) may have been so far slightly 
overlooked and considered a minor feature, while using boostrap samples may 
actually be a key computational feature in such algorithms for good model 
selection performances, and eventually good prediction performances on real 
datasets. 

The current work could be extended in various ways: first, we have focused 
on a fixed total number of variables, and allowing the numbers of variables to 
grow is important in theory and in practice (Meinshausen & Yu, 2006). Second, 
the same technique can be applied to similar settings than least-square regres- 
sion with the ^i-norm, namely regularization by block fi-norms (Bach, 2007) 
and other losses such as general convex classification losses. Finally, theoret- 
ical and practical connections could be made with other work on resampling 
methods and boosting (Buhlmann, 2006). 

A Proof of Model Consistency Results 

In this appendix, we give sketches of proofs for the asymptotic results presented 
in Section 2 and Section 3. The proofs rely on the well-known property of the 
Lasso optimization problems, namely that if the sign pattern of the solution is 
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known, then we can get the solution in closed form. 
A.l Optimality Conditions 

Wc let denote e = Y - Iw e R n , Q = X~ T X/n E W x p and q = X^ejn E W. 
First, wc can cquivalcntly rewrite Eq. (1) as: 

min i(w — w) T Q(w — w) — q T (w — w) + Li„||u>||i. (3) 

w£RP Z 

The optimality conditions for Eq. (3) can be written in terms of the sign pattern 
s = s(w) = sign(w) and the sparsity pattern J = J(w) = {j, Wj ^ 0} (Yuan & 
Lin, 2007): 

\\(Qj-.jQjjQjj - <3j-j) w j + (Qj°jQj}q.j - qj») 

+f-nQ.J"jQ jjS J || oo < l"n, (4) 

sign(<37jQjjWj + Qj}qj - HnQj]sj) = sj. (5) 

In this paper, we focus on regularization parameters /i„ of the form [i n = 
^ ri -1 / 2 . The main idea behind the results is to consider that {Q,q) are dis- 
tributed according to their limiting distributions, obtained from the law of large 
numbers and the central limit theorem, i.e., Q converges to Q a.s. and n x / 2 q 
is asymptotically normally distributed with mean zero and covariance matrix 
er 2 Q. When assuming this, Propositions 1 and 2 are straightforward. The main 
effort is to make sure that we can safely replace (Q, q) by their limiting distri- 
butions. The following lemmas give sufficient conditions for correct estimation 
of the signs of variables in J and for selecting a given pattern s (note that all 
constants could be expressed in terms of Q and w, details are omitted here): 

Lemma 1 Assume (A2) and \\Q — QH2 ^ A m ; n (Q)/2. Then sign(u)j) ^ 
sign(wj) implies \\Q~ 1/2 qh ^ C\ - ^ n C 2 , where Ci,C 2 > 0. 

Lemma 2 Assume (A2) and let s E { — 1,0, 1} P such that sj = sign(wj). Let 
J = {j, Sj 7^ 0} D J. Assume 

HQ - Qh ^ min{r/ 1; A min (Q)/2}, (6) 
||CT 1/2 <Z||2 < mm{ V2 , & - finC±}, (7) 

WQjcjQjjU - <?./<= - m™Q./<=jQ7j s jIU < Mn 

-Csrupin - C 6 r?i772, (8) 

Mi E J\3,Si [Qj}(qj - n»sj)].^ n n C77]i+Csm r l2, (9) 
with C4, C5, Cq, C7, C*8 are positive constants. Then sign(u)) ~ sign(w). 

Those two lemmas are interesting because they relate optimality of certain sign 
patterns to quantities from which we can derive concentration inequalities. 
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A. 2 Concentration Inequalities 

Throughout the proofs, we need to provide upper bounds on the following quan- 
tities P(||Q" 1/2 <?|| 2 > a) and P(||Q - Q|| 2 > rf). We obtain, following standard 
arguments (Boucheron et al., 2004): if a < C 9 and r\ < C w (where C 9 , C w > 
are constants), 

P(I|CT 1/2 9||2 >a) ^4pexp(-^) . 

P(||Q-Q|| 2 >r/)^ 2 cxp(-^). 

We also consider multivariate Berry-Esseen inequalities (Bentkus, 2003); the 
probability P(n 1 / 2 g 6 C) can be estimated as P(i £ C) where t is normal with 
mean zero and covariance matrix cr 2 Q. The error is then uniformly (for all 
convex sets C) upperbounded by: 

400 P 1 /4 ?1 -i/2 Amin(Q) -3/2 E|£|3||x|| 3 = Clin -i/ a . 
A. 3 Proof of Proposition 1 

By Lemma 2, for any given A, and n large enough, the probability that the sign 
is different from s is upperbounded by 

P(||Q-Va g || a > A ^f' 2 ) + P(||Q - Qh > A(l Z?t /2 ) 

+P {t i C(s, no(l - a))} + 2dm- 1 / 2 , 

whereC(s,/3) is the set of t such that (a) \\QjcjQjjtj— tjc— /3Q./^jQ7,/ s j||oo ^ 
[3 and (b) for all i 6 J\3,Si[Qjj(tj — @8j)]. > 0. Note that here a = 
O {(log njn^ 1 / 2 ) tends to zero and that we have: P {t £ C(s, jUo(l — a ))} ^ 
¥ {t <£ C(s,/J,o)}+O(a). All terms (if A is large enough) are thus 0((log7i)n _1/2 ). 

This shows that P(sign(w) = sign(w)) ^ p(s,/i ) + 0((log n)n~ x / 2 ) where 
p(s,/xo) = P{t £ C(s,//o)} S (0, l)-the probability is strictly between and 1 
because the set and its complement have non empty interiors and the normal 
distribution has a positive definite covariance matrix cr 2 Q. The other inequality 
can be proved similarly. Note that the constant in 0((logn)n -1 / 2 ) depends 
on fiQ but by carefully considering this dependence on /io, we could make the 
inequality uniform in fiQ as long as tends to zero or infinity at most at 
a logarithmic speed (i.e., fi n deviates from n -1 / 2 by at most a logarithmic 
factor). Also, it would be interesting to consider uniform bounds on portions of 
the regularization path. 

A. 4 Proof of Proposition 2 

From Lemma 1, the probability of not selecting any of the variables in J is 
upperbounded by PfllQ-^ffHa > Ci - Hn<h) + P(IIQ - Qh > A min (Q)/2), 
which is straightforwardly upper bounded (using Section A. 2) by a term of the 
required form. 
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A. 5 Proof of Proposition 3 

In order to simplify the proof, we made the simplifying assumption that the 
random variables X and e have compact supports. Extending the proofs to take 
into account the looser condition that ||X|| 2 and e 2 have non uniformly infinite 
cumulant generating functions (i.e., assumption (Al)) can be done with minor 
changes. The probability that flfcLi Jk is different from J is upper bounded by 
the sum of the following probabilities: 

(a) Selecting at least variables in J: the probability that for the fc-th 
replication, one index in J is not selected, each of them which is upper bounded 
by P(||Q _1/ VHa > Ci/2)+P(||Q-Q*|| a > A min (Q)/2), where q* corresponds 
to the ghost sample; as common in theoretical analysis of the bootstrap, we 
relate q* to q as follows: F{\\Q-V 2 q* \\ 2 > Ci/2) «S P(||Q~ 1/2 (g* - q)h > 
d/4) +P(||Q- 1 / 2 g|| 2 > Ci/4) (and similarly for P(||Q - Q*\\ 2 > A min (Q)/2)). 
Because we have assumed that X and e have compact supports, the boot- 
strapped variables have also compact support and we can use concentration 
inequalities (given the original variables X, and also after expectation with re- 
spect to X) . Thus the probability for one bootstrap replication is uppcrbounded 
by Be~ Cn where B and C are strictly positive constants. Thus the overall con- 
tribution of this part is less than mBe~ Cn . 

(b) Selecting at most variables in J: the probability that for all replica- 
tions, the set J is not exactly selected (note that this is not tight at all since 
on top of the relevant variables which are selected with overwhelming probabil- 
ity, different additional variables may be selected for different replications and 
cancel out when intersecting). 

Our goal is thus to bound E{P(J* ^ J|X) m }. By previous lemmas, we 

have that P(J* ^ J|X) is upper bounded by P (\\Q- 1/2 q*\\ 2 > A( '° g 1/ " )1/2 \%) + 

P (||Q - QII2 > A(1 : s 1 /" )1/2 \X) +V(t* i C^ )\X) + 2C 11 n- 1 / 2 + OC-^), where 

now, given X, Y, t* is normally distributed with mean n 1 / 2 ^ and covariance 
matrix A X)f=i efxtxj. 

The first two terms and the last two ones are uniformly 0(^$-) (if A is large 
enough). We then have to consider the remaining term. We have C([1q) = {t* £ 
P- p , IIQjcjQjjij - £jc - MoQj^jQjJsjIIoo ^ Mo}- By Hoeffding's inequality, 
we can replace the covariance matrix that depends on X and Y by er 2 Q, at 
cost 0(n -1 / 2 ). We thus have to bound P(n 1 / 2 g + y £ C(/io)|s) f° r V normally 
distributed and C(/j,q) a fixed compact set. Because the set is compact, there 
exist constants A, B > such that, if Hn 1 / 2 ^^ ^5 a for a large enough, then 
P(n 1 / 2 q + y £ C(iM))\q) < 1 - Ae~ Ba2 . Thus, by truncation, we obtain a 
bound of the form: E {P(J* ^ 3\X) m } < (1 - Ae~ Bo? + F^) m + Ce~ Ba2 sC 
exp(— mAe~ Ba +mF^Y^)+Ce^ Ba ' , where we have used Hoeffding's inequality 
to upper bound Pdln 1 / 2 ^!^ > a). By minimizing in closed form with respect to 
e- Ba \ i.e., with e- Bc? = ^£ + log( ^ /c,) , we obtain the desired inequality 
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